A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language

نویسندگان

Sajjad Ahmad Khan

Waqas Anwar

Usama Ijaz Bajwa

Xuan Wang

چکیده

Stemming is a procedure that conflates morphologically related terms into a single term without doing complete morphological analysis. Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. The core tool of information retrieval (IR) is a Stemmer which reduces a word to its stem form. Due to the diverse nature of Urdu, developing its stemmer for an IR system is a challenging task. This paper presents a light weight stemmer for Urdu text, which uses rule based approach. Exceptional lists are developed to enhance the accuracy of the stemmer. The result of the stemmer is quite enough and can be effective in IR system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenges of Urdu Named Entity Recognition: A Scarce Resourced Language

In this study, we present a brief overview of Named Entity Recognition (NER) system, various approaches followed for NER systems and finally NER systems for Urdu language. Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. Research against NER systems in Urdu language is at infancy stage therefore the focus of this study is on challe...

متن کامل

Challenges in Developing a Rule based Urdu Stemmer

Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. In this language, morphological processing becomes particularly important for Information Retrieval (IR). The core tool of IR is a Stemmer which reduces a word to its stem form. Due to the diverse nature of Urdu, developing stemmer is a challenging task. In Urdu, there are large numb...

متن کامل

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

Extraction of named entities (NEs) from the text is an important operation in many natural language processing applications like information extraction, question answering, machine translation etc. Since early 1990s the researchers have taken greater interest in this field and a lot of work has been done regarding Named Entity Recognition (NER) in different languages of the world. Unfortunately...

متن کامل

A Rule based Stemming Method for Multilingual Urdu Text

Urdu is a national language of Pakistan and spoken more than 200 million people use it as a verbal and written communication. There exists a large amount of unstructured Urdu textual data in the world; by applying data mining techniques useful information can be achieved. However it seriously lacks processing capabilities to develop innovative systems based on Urdu language. In this paper, auth...

متن کامل

Rule Based Urdu Stemmer

This paper presents Rule based Urdu Stemmer. In this technique rules are applied to remove suffix and prefix from the inflected words. Urdu is well spoken language all over the world but less work has been done on Urdu stemming. Stemmer helps us to find the root of the inflected word. Various possibilities of inflected words like ںو (vao+noon-gunna), ے (badi-ye), ںای (choti-ye+alif+noon-gunna) ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language

نویسندگان

چکیده

منابع مشابه

Challenges of Urdu Named Entity Recognition: A Scarce Resourced Language

Challenges in Developing a Rule based Urdu Stemmer

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

A Rule based Stemming Method for Multilingual Urdu Text

Rule Based Urdu Stemmer

عنوان ژورنال:

اشتراک گذاری